1 Introduction

There are two main ways of working in R:

The base R has been around for decades now, it was made by statistians and software programmers for statistians. R is the evolution of S, a statistical programming language. It has some S legacy (less and less every update) and didn’t count for the “new data science” since this field, even being statistics and algebra, it is a pretty new field.

https://www.tidyverse.org/ is an opinionated collection of packages design for data science. They all - except ggplot2 - follow the same grammar and allow for mixing and matching, working pretty well together.

Some criticism:

Even with all the advantages we have with this bundle of packages in my experience working with big datasets and the tidyverse can be slow, so that’s why I recommend it for exploration and testing models. I’m sure the 98% of the times you are working with data you will be safe, it is not like we are managing big datasets or big data everyday.

In this workshop we are tackling data the tidy way, some functional programming will allow us to manage and explore many models at once. This goes beyond the usual exploration - modelling to use modelling as a form of data exploration.

You could argue that a model (almost by definition) is close to your initial assumptions, that’s why the exploration step is done. Here we are going to use modelling to learn about the data, not only predictions but behavior of different groups of data within a dataset.

The outline of this exercise is to create a dataframe where there’s one row per group an all data associated with that group is stored in a column of dataframes. Then, one can apply modelling techniques to each group and use the output to learn about each group.


2 Take a look at the data

First, I’ll need to load the data, and reshape it so that it’s easier to work with.

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(broom)

2.1 Loading Data

Depending on your computer limitations, I decided to work with only half of the observations. Nonetheless, you are welcome you use every observation or read a limited number of rows with the argument nmax in the read_csv function.

set.seed(42)
# Training data
data <- read_csv(file = "../data/clean.csv") %>% 
  sample_frac(1)
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   article = col_character(),
##   language = col_character(),
##   domain = col_character()
## )
## See spec(...) for full column specifications.

2.2 Making the data Tidy

At first glance of the data:

head(data)
## # A tibble: 6 x 806
##   article language domain X2015.07.01 X2015.07.02 X2015.07.03 X2015.07.04
##   <chr>   <chr>    <chr>        <dbl>       <dbl>       <dbl>       <dbl>
## 1 攻殼機動隊… zh       wikip…         463         435         474         444
## 2 海尔·塞拉西… zh       wikip…          75          39          37          62
## 3 Gaten_… en       wikip…          NA          NA          NA          NA
## 4 Сифилис ru       wikip…         730         784         695         611
## 5 Superm… en       wikip…        9341        9844        9818        9124
## 6 Panzer… de       wikip…           4          18          18           4
## # … with 799 more variables: X2015.07.05 <dbl>, X2015.07.06 <dbl>,
## #   X2015.07.07 <dbl>, X2015.07.08 <dbl>, X2015.07.09 <dbl>,
## #   X2015.07.10 <dbl>, X2015.07.11 <dbl>, X2015.07.12 <dbl>,
## #   X2015.07.13 <dbl>, X2015.07.14 <dbl>, X2015.07.15 <dbl>,
## #   X2015.07.16 <dbl>, X2015.07.17 <dbl>, X2015.07.18 <dbl>,
## #   X2015.07.19 <dbl>, X2015.07.20 <dbl>, X2015.07.21 <dbl>,
## #   X2015.07.22 <dbl>, X2015.07.23 <dbl>, X2015.07.24 <dbl>,
## #   X2015.07.25 <dbl>, X2015.07.26 <dbl>, X2015.07.27 <dbl>,
## #   X2015.07.28 <dbl>, X2015.07.29 <dbl>, X2015.07.30 <dbl>,
## #   X2015.07.31 <dbl>, X2015.08.01 <dbl>, X2015.08.02 <dbl>,
## #   X2015.08.03 <dbl>, X2015.08.04 <dbl>, X2015.08.05 <dbl>,
## #   X2015.08.06 <dbl>, X2015.08.07 <dbl>, X2015.08.08 <dbl>,
## #   X2015.08.09 <dbl>, X2015.08.10 <dbl>, X2015.08.11 <dbl>,
## #   X2015.08.12 <dbl>, X2015.08.13 <dbl>, X2015.08.14 <dbl>,
## #   X2015.08.15 <dbl>, X2015.08.16 <dbl>, X2015.08.17 <dbl>,
## #   X2015.08.18 <dbl>, X2015.08.19 <dbl>, X2015.08.20 <dbl>,
## #   X2015.08.21 <dbl>, X2015.08.22 <dbl>, X2015.08.23 <dbl>,
## #   X2015.08.24 <dbl>, X2015.08.25 <dbl>, X2015.08.26 <dbl>,
## #   X2015.08.27 <dbl>, X2015.08.28 <dbl>, X2015.08.29 <dbl>,
## #   X2015.08.30 <dbl>, X2015.08.31 <dbl>, X2015.09.01 <dbl>,
## #   X2015.09.02 <dbl>, X2015.09.03 <dbl>, X2015.09.04 <dbl>,
## #   X2015.09.05 <dbl>, X2015.09.06 <dbl>, X2015.09.07 <dbl>,
## #   X2015.09.08 <dbl>, X2015.09.09 <dbl>, X2015.09.10 <dbl>,
## #   X2015.09.11 <dbl>, X2015.09.12 <dbl>, X2015.09.13 <dbl>,
## #   X2015.09.14 <dbl>, X2015.09.15 <dbl>, X2015.09.16 <dbl>,
## #   X2015.09.17 <dbl>, X2015.09.18 <dbl>, X2015.09.19 <dbl>,
## #   X2015.09.20 <dbl>, X2015.09.21 <dbl>, X2015.09.22 <dbl>,
## #   X2015.09.23 <dbl>, X2015.09.24 <dbl>, X2015.09.25 <dbl>,
## #   X2015.09.26 <dbl>, X2015.09.27 <dbl>, X2015.09.28 <dbl>,
## #   X2015.09.29 <dbl>, X2015.09.30 <dbl>, X2015.10.01 <dbl>,
## #   X2015.10.02 <dbl>, X2015.10.03 <dbl>, X2015.10.04 <dbl>,
## #   X2015.10.05 <dbl>, X2015.10.06 <dbl>, X2015.10.07 <dbl>,
## #   X2015.10.08 <dbl>, X2015.10.09 <dbl>, X2015.10.10 <dbl>,
## #   X2015.10.11 <dbl>, X2015.10.12 <dbl>, …

We have 3 clean columns:

  • article
  • language
  • domain

And the other 803 are dates and number of views. To get this to be tidy we need:

  1. Every column represents one variable
  2. Every row represents one observations.

Which will gives us:

  • article name
  • language
  • date
  • number of views
# convert to long format. 
data <- data %>%
      gather(key = "date", value = "views", -article, -language, -domain)

tail(data)
## # A tibble: 6 x 5
##   article                    language domain    date        views
##   <chr>                      <chr>    <chr>     <chr>       <dbl>
## 1 Topic:T2szzp4bocj97lno     <NA>     mediawiki X2017.09.10    NA
## 2 Help:CirrusSearch/ja       <NA>     mediawiki X2017.09.10     6
## 3 Перестройка                ru       wikipedia X2017.09.10   384
## 4 Австралия_на_«Евровидении» ru       wikipedia X2017.09.10    10
## 5 Шведов,_Денис_Эдуардович   ru       wikipedia X2017.09.10   163
## 6 Елена_Стефановна           ru       wikipedia X2017.09.10    77

Now, our dataframe of 28,000 rows and ~800 columns has been reshaped to a dataframe of ~22,500,000 rows and 5 columns.

2.3 Consolidating the data

The next step is to make sure each of the columns are of the right data type.

str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    22484000 obs. of  5 variables:
##  $ article : chr  "攻殼機動隊" "海尔·塞拉西一世" "Gaten_Matarazzo" "Сифилис" ...
##  $ language: chr  "zh" "zh" "en" "ru" ...
##  $ domain  : chr  "wikipedia" "wikipedia" "wikipedia" "wikipedia" ...
##  $ date    : chr  "X2015.07.01" "X2015.07.01" "X2015.07.01" "X2015.07.01" ...
##  $ views   : num  463 75 NA 730 9341 ...

date is a character which doesn’t look right, to use it as a date we will have to remove the leading X character and convert the remaining to date. We also need to convert from character to date class but we are going to wait a bit before doing that.

data <- data %>%
  mutate(date = str_replace(pattern = "X", replacement = "", date))
  
head(data)
## # A tibble: 6 x 5
##   article             language domain    date       views
##   <chr>               <chr>    <chr>     <chr>      <dbl>
## 1 攻殼機動隊          zh       wikipedia 2015.07.01   463
## 2 海尔·塞拉西一世     zh       wikipedia 2015.07.01    75
## 3 Gaten_Matarazzo     en       wikipedia 2015.07.01    NA
## 4 Сифилис             ru       wikipedia 2015.07.01   730
## 5 Superman            en       wikipedia 2015.07.01  9341
## 6 Panzerkampfwagen_IV de       wikipedia 2015.07.01     4

2.4 Missing values

Just so we’re aware - how many values are missing? Later we are applying an ARIMA model which doesn’t do well on missing values.

7.0753291 % of the view counts are missing. We can’t be sure if this means that there were zero views for that page on that day, or if the view count data is simply missing for that day.


3 Nesting dataframes - a dataframe for every page.

Now, we transform the dataset into a dataframe such that every page + language + domain populates one row, and the corresponding data is stored in a list column. This is a three-liner with the tidyr::nest() function. A word of warning - creating list columns with the tidyr::nest() function is very slow if one of your columns is of type Date (see this Github issue for more details). As such, I will first nest the dataframes, and then convert the date column in each nested dataframe (initially a character datatype) to a native date format.

# dataframes within dataframes
nested <-  data %>%
  group_by(article, language, domain) %>% 
  nest()

head(nested)
## # A tibble: 6 x 4
##   article             language domain    data              
##   <chr>               <chr>    <chr>     <list>            
## 1 攻殼機動隊          zh       wikipedia <tibble [803 × 2]>
## 2 海尔·塞拉西一世     zh       wikipedia <tibble [803 × 2]>
## 3 Gaten_Matarazzo     en       wikipedia <tibble [803 × 2]>
## 4 Сифилис             ru       wikipedia <tibble [803 × 2]>
## 5 Superman            en       wikipedia <tibble [803 × 2]>
## 6 Panzerkampfwagen_IV de       wikipedia <tibble [803 × 2]>

Let’s take a look to the first observation:

nested %>% 
  head(1) %>% 
  select(data) %>% 
  .$data
## [[1]]
## # A tibble: 803 x 2
##    date       views
##    <chr>      <dbl>
##  1 2015.07.01   463
##  2 2015.07.02   435
##  3 2015.07.03   474
##  4 2015.07.04   444
##  5 2015.07.05   415
##  6 2015.07.06   378
##  7 2015.07.07   379
##  8 2015.07.08   378
##  9 2015.07.09   321
## 10 2015.07.10   427
## # … with 793 more rows

Now we can transform the date to a date type.

nested <- nested %>%
      mutate(data = map(.f = ~mutate(., date = as.POSIXct(date, 
                                                       format = "%Y.%m.%d"))
                        , .x = data))

We have everything like we wanted, we can delete our first dataframe to free some memory.

head(nested %>% head(1) %>% .$data)
## [[1]]
## # A tibble: 803 x 2
##    date                views
##    <dttm>              <dbl>
##  1 2015-07-01 00:00:00   463
##  2 2015-07-02 00:00:00   435
##  3 2015-07-03 00:00:00   474
##  4 2015-07-04 00:00:00   444
##  5 2015-07-05 00:00:00   415
##  6 2015-07-06 00:00:00   378
##  7 2015-07-07 00:00:00   379
##  8 2015-07-08 00:00:00   378
##  9 2015-07-09 00:00:00   321
## 10 2015-07-10 00:00:00   427
## # … with 793 more rows
rm(data)

4 Before we start modeling… what’s the data look like?

The point of this notebook is to show how to use modeling for data exploration. But even so - I can’t jump into modeling without knowing what the data looks like at the least.

I’m most interested in identifying structure in the data that will influence my choices of parameters in models to come, or which models to use. For example, I’d be interested in identifying any trend the pages could have. I’d also be interested in identifying any outliers or extreme values. This may motivate me to transform my data in some way, or use models that are robust to extreme values.

Some interesting questions are:

  1. What trend can we observe in the page views?
  2. How variable are the page views across the different pages?
  3. How many pages have very extreme page view counts?

4.1 Taking a look at the languages

nested %>%
  group_by(language) %>% 
  summarise(n = n()) %>%
  mutate(freq = n / sum(n) * 100) %>% 
  filter(freq > 1) %>% 
  drop_na() %>% 
  ggplot(aes(x = language, y = sort(freq), fill = language)) +
  geom_col() +
  ylab("Proportion of pages with language") + 
  theme(legend.position = "")

Using this extraction method, we can see those languages with more than 1% of total pages (droping NAs):

  1. German (de)
  2. English (en)
  3. French (fr)
  4. Russian (ru)
  5. Chinese (zh)

The remaining are other, we can classify now and make our lives easier.

nested <- nested %>%
  ungroup() %>% 
  mutate(language = ifelse(
    language %in% c("de", "en", "fr", "ru", "zh", NA), language, "other"))

4.2 What the domain distribution looks like

nested %>%
  group_by(domain) %>% 
  summarise(n = n()) %>%
  mutate(freq = n / sum(n) * 100) %>% 
  drop_na() %>% 
  ggplot(aes(x = domain, y = sort(freq), fill = domain)) +
  geom_col() +
  ylab("Proportion of pages with language") + 
  theme(legend.position = "")

This seems to be working well. Now we can compare articles across different languages.

5 Modelling a linear trend

Now for the fun stuff - mapping models onto the nested data. As a first step, I will try and model the trend of the view count using simple linear regression. Then, looking at measures of model quality such as the \(R^2\), I can see which series are well explained with a linear trend, and which have more complex changes in mean.

To apply a linear model to each of the nested dataframes, I’ll first design a function that takes in a dataframe, and applies simple linear regression onto it. After, mapping this function onto each of the nested dataframes, we can get a new column, linear_trend, which stores linear models, fit onto each corresponding nested dataframe:

# a function for fitting SLR to an inptut dataframe
apply_lm <- function(df){
      lm(data = df, views ~ date)
}

# fit a linear model to each page
nested <-  nested %>%
      mutate(linear_trend = map(.f = apply_lm, .x = data))

# isolate the first page
first_page <- head(nested, 1)

Now, along with a list column of the data for each page in a column, we also have a fitted linear model object stored in a seperate column for each wikipedia page:

nested %>%
  head() %>%
  select(article, data, linear_trend)
## # A tibble: 6 x 3
##   article             data               linear_trend
##   <chr>               <list>             <list>      
## 1 攻殼機動隊          <tibble [803 × 2]> <lm>        
## 2 海尔·塞拉西一世     <tibble [803 × 2]> <lm>        
## 3 Gaten_Matarazzo     <tibble [803 × 2]> <lm>        
## 4 Сифилис             <tibble [803 × 2]> <lm>        
## 5 Superman            <tibble [803 × 2]> <lm>        
## 6 Panzerkampfwagen_IV <tibble [803 × 2]> <lm>

For example, if we wanted to see the summary of the first linear model fit:

nested %>% head(1) %>% .$linear_trend 
## [[1]]
## 
## Call:
## lm(formula = views ~ date, data = df)
## 
## Coefficients:
## (Intercept)         date  
##  -4.638e+04    3.235e-05

It’d be interesting to store a measure of model quality for each of these linear models - namely the \(R^2\) statistic. This will be helpful, as looking at each model’s \(R^2\) will help us highlight which Wikipedia pages exhibit clear linear trend, and which don’t (this might be hard to determine otherwise - can you think of a good way to do so?)

I’ll define a function extract_r2 - which uses the broom function to extract the \(R^2\) of a linear model. I’ll then map this function onto nested lm models to store the \(R^2\) for each model:

# a function for extracting only the R-squared statistics
extract_r2 <- function(model){
  glance(model)$r.squared
}

# map this function onto each model to store the R^2
nested <- nested %>%
  mutate(lm.r.squared = purrr::map_dbl(.f = extract_r2, .x = linear_trend))

Looking at the distribution of \(R^2\) across the different Wikipedia pages:

nested %>%
  ggplot(aes(x = lm.r.squared)) + 
  geom_density()
## Warning: Removed 1 rows containing non-finite values (stat_density).

Most of the time series can not be explained well by a linear model, leading to low \(R^2\).

Some models have suspiciously high \(R^2\) values - I suspect this is because most of the data is missing, and thus a linear model can fit these sparse data more effectively. To test this hypothesis, I can plot the model with the highest \(R^2\):

# a funtion for plotting a time series, with a fitted linear trend line on top of it
plot_linear_trend <- function(df, title){
  df %>%
    ggplot(aes(x = date, y = views, color = views)) +
    geom_point() + 
    geom_line() + 
    geom_smooth(method = "lm", se = FALSE) + 
    labs(title = title) + 
    theme(legend.position = "")
}

nested %>%
  arrange(desc(lm.r.squared)) %>%
  head(1) %>%
  mutate(chart = map2(.f = plot_linear_trend, .x = data, .y = article)) %>%
  .$chart
## [[1]]
## Warning: Removed 801 rows containing non-finite values (stat_smooth).
## Warning: Removed 801 rows containing missing values (geom_point).
## Warning: Removed 801 rows containing missing values (geom_path).

Indeed - the model with the highest \(R^2\) has only 2 non-missing points - which can be fitted perfectly by a line.

But if we skip the pages with the 50 highest \(R^2\) values, we can really see that these models have a roughly linear trend:

nested %>%
      arrange(desc(lm.r.squared)) %>%
      filter(dplyr::between(row_number(), 50, 55)) %>%
      mutate(chart = purrr::map2(.f = plot_linear_trend, 
                                 .x = data, .y = article)) %>%
      .$chart
## [[1]]
## Warning: Removed 380 rows containing non-finite values (stat_smooth).
## Warning: Removed 380 rows containing missing values (geom_point).
## Warning: Removed 378 rows containing missing values (geom_path).

## 
## [[2]]
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).

## 
## [[3]]
## Warning: Removed 327 rows containing non-finite values (stat_smooth).
## Warning: Removed 327 rows containing missing values (geom_point).
## Warning: Removed 327 rows containing missing values (geom_path).

## 
## [[4]]
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).

## 
## [[5]]
## Warning: Removed 330 rows containing non-finite values (stat_smooth).
## Warning: Removed 330 rows containing missing values (geom_point).
## Warning: Removed 330 rows containing missing values (geom_path).

## 
## [[6]]

Now, looking at the model with the lowest \(R^2\):

nested %>%
      arrange(lm.r.squared) %>%
      .[1,] %>%
       mutate(chart = purrr::map2(.f = plot_linear_trend, 
                                  .x = data, .y = article)) %>%
      .$chart
## [[1]]
## Warning: Removed 802 rows containing non-finite values (stat_smooth).
## Warning: Removed 802 rows containing missing values (geom_point).
## Warning: Removed 802 rows containing missing values (geom_path).
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?

Now, looking at plots of series which don’t have extraordinarily high or low \(R^2\) (to avoid series with mostly missing values), we find some series that truly exhibit non-linear trends, resulting in low \(R^2\):

nested %>%
      arrange(desc(lm.r.squared)) %>%
      filter(dplyr::between(row_number(), 1000, 1005)) %>%
      mutate(chart = purrr::map2(.f = plot_linear_trend, 
                                 .x = data, .y = article)) %>%
      .$chart
## [[1]]

## 
## [[2]]
## Warning: Removed 60 rows containing non-finite values (stat_smooth).
## Warning: Removed 60 rows containing missing values (geom_point).
## Warning: Removed 38 rows containing missing values (geom_path).

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]